Let’s import the cleaned dataset that we created.
Mountain_data_cleaned <- read.csv("../data/Mountain_data_cleaned.csv")
Some of the models we use do not work with NAs. To deal with them, we decided to replace the 2 NAs we have by the mean of their closest observations.
mean_for_fill <-colMeans( Mountain_data_cleaned %>%
select(Plot,Glu_P)%>%
filter(Plot %in% c(76, 77))%>%na.omit()%>% select(Glu_P))
Mountain_data_cleaned[is.na(Mountain_data_cleaned)] <- mean_for_fill
rm(mean_for_fill)
set.seed(123) ## for replication purpose
## the index of the rows that will be in the training set
index.tr <- sample(1:nrow(Mountain_data_cleaned), replace=FALSE,
size=0.75*nrow(Mountain_data_cleaned))
Mountain_data.tr_notsubs <- Mountain_data_cleaned[index.tr,] ## the training set
Mountain_data.te <- Mountain_data_cleaned[-index.tr,] ## the test set
As discussed in the EDA part, we should balance our data because we do not have the same amount of information on each mountains. We have more observations on Sierra de Guadarrama and half less on Central Andes.
no.mountain_1 <- min(table(Mountain_data.tr_notsubs$Mountain_range)) ## 79
## the "Central Andes" cases
data.tr.mountain_1 <- filter(Mountain_data.tr_notsubs, Mountain_range=="Central Andes")
## the "Central Pyrenees" cases
data.tr.mountain_2 <- filter(Mountain_data.tr_notsubs, Mountain_range=="Central Pyrenees")
## The "Sierra de Guadarrama" cases
data.tr.mountain_3 <- filter(Mountain_data.tr_notsubs, Mountain_range=="Sierra de Guadarrama")
## sub-sample 79 instances from the number of "Central Pyrenees" cases
index.mountain_2 <- sample(size=no.mountain_1,
x=1:nrow(data.tr.mountain_2),
replace=FALSE)
## sub-sample 79 instances from the number of "Sierra de Guadarrama" cases
index.mountain_3 <- sample(size=no.mountain_1,
x=1:nrow(data.tr.mountain_3),
replace=FALSE)
## Bind all the "Central Andes" and the sub-sampled "Central Pyrenees"
## and the sub-sampled "Sierra de Guadarrama"
Mountain_data.tr <- data.frame(rbind(data.tr.mountain_1,
data.tr.mountain_2[index.mountain_2,],
data.tr.mountain_3[index.mountain_3,]))
## The cases are now balanced
table(Mountain_data.tr$Mountain_range)
##
## Central Andes Central Pyrenees Sierra de Guadarrama
## 79 79 79
Simple hyperparameter tuning, this code takes time to run.
set.seed(1)
fitControl <- trainControl(method = "cv",
number = 10)
nnetGrid <- expand.grid(size = seq(from = 1, to = 6, by = 1),
decay = seq(from = 0.1, to = 0.5, by = 0.1))
nnetFit <- train(Mountain_range ~ .,
data = Mountain_data.tr,
method = "nnet",
metric = "Accuracy",
tuneGrid = nnetGrid,
trControl = fitControl)
plot(nnetFit)
The best Neural Networks parameters would be to choose 4 hidden layers, with a decay of 0.1.
The manually written Neural Network model
set.seed(345)
nn4 <- nnet(Mountain_range ~ ., data=Mountain_data.tr, size=4, decay = 0.1)
## # weights: 747
## initial value 343.739521
## iter 10 value 260.410032
## iter 20 value 257.569875
## iter 30 value 255.306821
## iter 40 value 230.131410
## iter 50 value 63.237367
## iter 60 value 49.787755
## iter 70 value 49.290640
## iter 80 value 33.426478
## iter 90 value 19.428864
## iter 100 value 15.167996
## final value 15.167996
## stopped after 100 iterations
nn4_pred <- predict(nn4, Mountain_data.te, type="class")
tab4 <- table(Obs=Mountain_data.te$Mountain_range, Pred=nn4_pred) # confusion matrix
tab4
## Pred
## Obs Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 21 0 0
## Central Pyrenees 1 28 0
## Sierra de Guadarrama 0 0 58
(acc4 <- sum(diag(tab4))/sum(tab4)) # accuracy
## [1] 0.9907407
Here it says that it has almost perfect accuracy (99%).
# Confusion Matrix
confusionMatrix(data=as.factor(nn4_pred), reference = Mountain_data.te$Mountain_range)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 21 1 0
## Central Pyrenees 0 28 0
## Sierra de Guadarrama 0 0 58
##
## Overall Statistics
##
## Accuracy : 0.9907
## 95% CI : (0.9495, 0.9998)
## No Information Rate : 0.537
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9846
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Central Andes Class: Central Pyrenees
## Sensitivity 1.0000 0.9655
## Specificity 0.9885 1.0000
## Pos Pred Value 0.9545 1.0000
## Neg Pred Value 1.0000 0.9875
## Prevalence 0.1944 0.2685
## Detection Rate 0.1944 0.2593
## Detection Prevalence 0.2037 0.2593
## Balanced Accuracy 0.9943 0.9828
## Class: Sierra de Guadarrama
## Sensitivity 1.000
## Specificity 1.000
## Pos Pred Value 1.000
## Neg Pred Value 1.000
## Prevalence 0.537
## Detection Rate 0.537
## Detection Prevalence 0.537
## Balanced Accuracy 1.000
With this confusion matrix command we have more information on the model. As said before, we see that the accuracy is very high (99%) and we also see that we have a balanced accuracy of 1 which is the maximum we can get and which mean that our model do not suffer from unbalanced data.
train_set_for_RF <- Mountain_data.tr %>%
select(!c(Plot, Subplot, Date, Day, Month, Year, Locality, Country))
test_set_for_RF <- Mountain_data.te%>%
select(!c(Plot, Subplot, Date, Day, Month, Year, Locality, Country))
rf <- randomForest( Mountain_range ~ .,data=train_set_for_RF, importance = TRUE)
predtr <- predict(rf, newdata = train_set_for_RF[-1])
predte <- predict(rf, newdata=test_set_for_RF[-1])
cmtr <- table(train_set_for_RF[,1], predtr)
cmte <- table(test_set_for_RF[,1], predte)
varImpPlot(rf, cex = 0.65)
importance(rf)
## Central Andes Central Pyrenees Sierra de Guadarrama
## Radiation 3.13737888 2.303583 2.6745117
## Phos_P 3.16069796 5.801423 6.5140634
## Glu_P 1.60137498 1.398899 1.6249260
## SOC_P 4.01423103 4.188059 4.1334148
## NT_P 5.11591652 3.316876 3.7355699
## PT_P 2.95677261 4.311597 1.5562930
## K_P -0.09619991 1.331178 3.4685762
## pH_P 15.05352536 13.020120 16.9881797
## Cond_P 4.60393107 5.155277 3.3775076
## Phos_B 4.27608539 6.442918 6.0405404
## Glu_B 2.26558677 1.799629 3.7644969
## SOC_B 7.40221740 2.381187 7.6427334
## NT_B 9.01514124 5.587019 7.8622947
## PT_B 3.33835698 1.857367 1.6481046
## K_B 3.42100795 2.820063 2.6648140
## pH_B 16.49696920 15.863898 14.1175546
## Cond_B 10.68856356 10.045402 4.9915864
## Phos_T 4.19140393 7.942887 7.4260798
## Glu_T 1.12669814 2.199217 2.6690536
## SOC_T 7.45558803 2.924291 6.6968826
## NT_T 10.34572042 6.735634 8.2957005
## PT_T 3.23013037 3.002285 -0.1739031
## K_T 2.05614399 2.730476 2.4998274
## pH_T 15.48328130 14.648225 13.9304162
## Cond_T 9.81844426 10.168778 4.8952827
## MeanDecreaseAccuracy MeanDecreaseGini
## Radiation 4.116329 0.3343212
## Phos_P 7.057318 3.7516764
## Glu_P 2.398319 0.1122100
## SOC_P 5.824931 2.4331305
## NT_P 5.711238 2.8770588
## PT_P 4.676746 0.3779176
## K_P 3.013458 0.3661828
## pH_P 17.947560 26.0658672
## Cond_P 6.498981 2.7284880
## Phos_B 7.853608 4.2337452
## Glu_B 4.371594 1.6683522
## SOC_B 8.984614 6.3292244
## NT_B 9.975941 8.5198157
## PT_B 3.375704 0.2074085
## K_B 4.424175 0.6111154
## pH_B 18.786750 25.8592483
## Cond_B 11.249806 9.8605651
## Phos_T 8.760424 6.6811926
## Glu_T 3.180503 0.9214452
## SOC_T 8.161092 6.6922297
## NT_T 10.338053 10.4797244
## PT_T 3.983457 0.3729806
## K_T 3.381527 0.3913723
## pH_T 18.236854 27.1060341
## Cond_T 11.221627 8.3969217
Acc_tr <- sum(diag(cmtr))/sum(cmtr)
Acc_te <- sum(diag(cmte))/sum(cmte)
From the variable importance we see that the variable pH_P, pH_T and pH_B are very important. We can then say that the pH in general is important.
# Confusion Matrix of the Random Forest model on training set
cmtr
## predtr
## Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 79 0 0
## Central Pyrenees 0 79 0
## Sierra de Guadarrama 0 0 79
Acc_tr
## [1] 1
# Confusion Matrix of the Random Forest model on training set
cmte
## predte
## Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 20 1 0
## Central Pyrenees 0 29 0
## Sierra de Guadarrama 0 0 58
Acc_te
## [1] 0.9907407
We see that the model has a accuracy of 99% when it comes to predict new observations.
The analysis of the “Naive Bayes” model comes back to the analysis of the density graphs of each variable by mountain range. From these density graphs we agree on what has been seen before, that the pH is a very good feature to classify the mountains. Indeed, pH_T ph_B and ph_P is medium in Central Andes, lower in Sierra de Guadarrama and higher in Central Pyrenees.
In addition to that, the observation of other variables such as phosphatase enzyme, (Phos_P, Phos_B, Phos_T), β-glucosidase enzyme (Glu_B, Glu_P, Glu_T) , soil organic carbon (SOC_T, SOC_B, SOC_P), soil total nitrogen (NT_P,NT_B,NT_T), electrical conductivity (Cond_P,Cond_B, Cond_T), and the radiation could allow us to get an idea of the mountain we are on. Indeed, a radiation lower than 0,6 indicates rather the Central Pyrenees. The electrical conductivity allows us to distinguish the Central Andes from the Central Pyrenees, since it is higher for the latter. If we look at the soil total nitrogen, we see that it is lower for the Central Andes than for the two others. The soil organic carbon allows us to distinguish the Central Andes from the Sierra de Guadarrama, since it is higher for this last one. The β-glucosidase enzyme is present at approximately the same level in Central Andes and Central Pyrenees but has a higher value in Sierra de Guadarrama. It is about the same for the phosphatase enzyme
However, some variables do not allow us to determine the mountain at all. It is the case by observing the phosphorus (PT_P, PT_B, PT_T) or the potassium content (K_P, K_B, K_T).
We use a 2-NN to predict the test set using the training set
set.seed(123)
KNN <- knn3(data=Mountain_data.tr,Mountain_data.tr$Mountain_range ~ ., k=2)
MR.te.pred <- predict(KNN, newdata = Mountain_data.tr,type ="class")
TAB <- table(Obs=Mountain_data.tr$Mountain_range, Pred=MR.te.pred) # confusion matrix
TAB
## Pred
## Obs Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 79 0 0
## Central Pyrenees 0 79 0
## Sierra de Guadarrama 0 0 79
(ACC <- sum(diag(TAB))/sum(TAB)) # accuracy
## [1] 1
total <- cbind(Mountain_data_cleaned$Mountain_range, Mountain_data.num)
total <- total%>% na.omit()
pc <- princomp(total[,2:26], cor=TRUE, scores=TRUE)
summary(pc)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 3.3066874 2.2052878 1.6416258 1.26611701 1.02192668
## Proportion of Variance 0.4373673 0.1945318 0.1077974 0.06412209 0.04177337
## Cumulative Proportion 0.4373673 0.6318990 0.7396964 0.80381853 0.84559189
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.97938528 0.83007360 0.70896429 0.64078160 0.57997005
## Proportion of Variance 0.03836782 0.02756089 0.02010521 0.01642404 0.01345461
## Cumulative Proportion 0.88395971 0.91152060 0.93162582 0.94804986 0.96150447
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 0.5045344 0.415942405 0.37188540 0.324735856 0.258563397
## Proportion of Variance 0.0101822 0.006920323 0.00553195 0.004218135 0.002674201
## Cumulative Proportion 0.9716867 0.978606991 0.98413894 0.988357076 0.991031277
## Comp.16 Comp.17 Comp.18 Comp.19
## Standard deviation 0.234410086 0.215792220 0.176906812 0.17192440
## Proportion of Variance 0.002197924 0.001862651 0.001251841 0.00118232
## Cumulative Proportion 0.993229201 0.995091852 0.996343693 0.99752601
## Comp.20 Comp.21 Comp.22 Comp.23
## Standard deviation 0.1420714868 0.1274096926 0.1128208099 0.0833230454
## Proportion of Variance 0.0008073723 0.0006493292 0.0005091414 0.0002777092
## Cumulative Proportion 0.9983333849 0.9989827141 0.9994918555 0.9997695647
## Comp.24 Comp.25
## Standard deviation 0.0594082468 4.723920e-02
## Proportion of Variance 0.0001411736 8.926167e-05
## Cumulative Proportion 0.9999107383 1.000000e+00
The first 2 components cover 63% of the total variance, with 5 dimensions 84.6%
plot(pc, type="lines")
As seen in the EDA, we can consider 5 dimensions. In the following graph we reduce in 3 dimensions the 3 mountains. Clusters may be apparent.
X <- Mountain_data.num%>% na.omit()
prin_comp <- prcomp(X, rank. = 3)
components <- prin_comp[["x"]]
components <- data.frame(components)
components$PC2 <- -components$PC2
components$PC3 <- -components$PC3
components = cbind(components, total$`Mountain_data_cleaned$Mountain_range`)
tit = 'Total Explained Variance = 74,03%'
fig <- plot_ly(components, x = ~PC1, y = ~PC2, z = ~PC3, color = ~total$`Mountain_data_cleaned$Mountain_range`, colors = c('#636EFA','#EF553B','#00CC96') ) %>%
add_markers(size = 12)
fig <- fig %>%
layout(
title = tit,
scene = list(bgcolor = "#e5ecf6")
)
fig
mountain.hc <- hclust(dist(total, method = "manhattan"))
mountain.hc
##
## Call:
## hclust(d = dist(total, method = "manhattan"))
##
## Cluster method : complete
## Distance : manhattan
## Number of objects: 430
mountain.clust <- cutree(mountain.hc, k = 4)
mountain.clust
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 1 1 1 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 1 1 1 1 1 2 1 1 1 1 1 2 2 2 2 1 1 2 1 1
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2 1 1
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 2 2 2 2 2 2 2 2 2 2 4 4 2 4 4 2 4 2 2 2
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 2 2 2 4 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 421 422 423 424 425 426 427 428 429 430
## 1 1 1 1 1 1 1 1 1 1
Mountain_data.pca <- prcomp(na.omit(Mountain_data.num), center = TRUE, scale = TRUE)
fviz_pca_biplot(Mountain_data.pca,
col.ind = factor(mountain.clust))
We now want to replicate the above models with a another training set and test set that has been through a boostraping method. To do so, we first want to delete all the repeated observations from the original cleaned dataset.
We remove every replicated data with the “unique” function.
Mountain_df <- Mountain_data_cleaned[-c(4:9)]
Mountain_df <- unique(Mountain_df)
ggplot(Mountain_df) +
geom_bar(aes(x = Mountain_range))
Here we see the new Unique dataset and notice that it is largely
unbalanced. The mountain that has the most information is now the one
with the less observations, so information.
set.seed(123) ## for replication purpose
# the index of the rows that will be in the training set
index.tr <- sample(1:nrow(Mountain_df), replace=FALSE,
size=0.75*nrow(Mountain_df))
Mountain_df.tr_notsubs <- Mountain_df[index.tr,] ## the training set
Mountain_df.te <- Mountain_df[-index.tr,] ## the test set
We can now proceed to the bootstraping with 100 replicates
set.seed(897)
index.boot <- createResample(y=Mountain_df.tr_notsubs$Mountain_range, times=100)
head(index.boot[[1]])
## [1] 1 2 3 4 5 5
tail(index.boot[[1]])
## [1] 198 198 199 200 201 204
df.boot.tr <- Mountain_df.tr_notsubs[index.boot[[1]],]
dim(df.boot.tr)
## [1] 205 28
df.boot.val <- Mountain_df.tr_notsubs[-index.boot[[1]],]
dim(df.boot.val)
## [1] 73 28
# The count of the bootstraped df
ggplot(df.boot.tr) +
geom_bar(aes(x = Mountain_range))
We can now replicate the models.
Simple hyperparameter tuning, this code takes time to run.
set.seed(1)
nnetFit_boot <- train(Mountain_range ~ .,
data = df.boot.tr,
method = "nnet",
metric = "Accuracy",
tuneGrid = nnetGrid,
trControl = fitControl)
plot(nnetFit_boot)
The best neural network should have 4 hidden units and a decay of 0.1. It is the same as before.
set.seed(345)
nn4_boot <- nnet(Mountain_range ~ ., data=df.boot.tr, size=4, decay = 0.3)
## # weights: 375
## initial value 284.692893
## iter 10 value 116.079753
## iter 20 value 57.538311
## iter 30 value 29.136661
## iter 40 value 22.652969
## iter 50 value 21.197795
## iter 60 value 20.339347
## iter 70 value 18.971778
## iter 80 value 18.386184
## iter 90 value 18.351844
## iter 100 value 18.341811
## final value 18.341811
## stopped after 100 iterations
nn4_pred_boot <- predict(nn4_boot, df.boot.val, type="class")
confusionMatrix(data=as.factor(nn4_pred_boot),
reference = df.boot.val$Mountain_range)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 24 0 0
## Central Pyrenees 0 39 0
## Sierra de Guadarrama 0 0 10
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9507, 1)
## No Information Rate : 0.5342
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Central Andes Class: Central Pyrenees
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.3288 0.5342
## Detection Rate 0.3288 0.5342
## Detection Prevalence 0.3288 0.5342
## Balanced Accuracy 1.0000 1.0000
## Class: Sierra de Guadarrama
## Sensitivity 1.000
## Specificity 1.000
## Pos Pred Value 1.000
## Neg Pred Value 1.000
## Prevalence 0.137
## Detection Rate 0.137
## Detection Prevalence 0.137
## Balanced Accuracy 1.000
row.names(Mountain_data_cleaned) <- paste("M", c(1:nrow(Mountain_data_cleaned)), sep="") # row names are used after
# scaling the data
Mountain_data_cleaned[,-c(1:9)] <- scale(Mountain_data_cleaned[,-c(1:9)])
# matrix of Manhattan distances
mountain.d <- dist(Mountain_data_cleaned[,-c(1:9)], method = "manhattan")
# create a data frame of the distances in long format
mountain.melt <- melt(as.matrix(mountain.d))
head(mountain.melt)
## Var1 Var2 value
## 1 M1 M1 0.00000
## 2 M2 M1 0.00000
## 3 M3 M1 0.00000
## 4 M4 M1 0.00000
## 5 M5 M1 0.00000
## 6 M6 M1 24.66131
ggplot(data = mountain.melt, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()
#dendrogram using a complete linkage.
mountain.hc <- hclust(mountain.d, method = "complete")
plot(mountain.hc, hang=-1)
#cut the tree to 4 clusters
plot(mountain.hc, hang=-1)
rect.hclust(mountain.hc, k=4)
mountain.clust <- cutree(mountain.hc, k=5)
mountain.clust
## M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16
## 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1
## M17 M18 M19 M20 M21 M22 M23 M24 M25 M26 M27 M28 M29 M30 M31 M32
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## M33 M34 M35 M36 M37 M38 M39 M40 M41 M42 M43 M44 M45 M46 M47 M48
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## M49 M50 M51 M52 M53 M54 M55 M56 M57 M58 M59 M60 M61 M62 M63 M64
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## M65 M66 M67 M68 M69 M70 M71 M72 M73 M74 M75 M76 M77 M78 M79 M80
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## M81 M82 M83 M84 M85 M86 M87 M88 M89 M90 M91 M92 M93 M94 M95 M96
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## M97 M98 M99 M100 M101 M102 M103 M104 M105 M106 M107 M108 M109 M110 M111 M112
## 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1
## M113 M114 M115 M116 M117 M118 M119 M120 M121 M122 M123 M124 M125 M126 M127 M128
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## M129 M130 M131 M132 M133 M134 M135 M136 M137 M138 M139 M140 M141 M142 M143 M144
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## M145 M146 M147 M148 M149 M150 M151 M152 M153 M154 M155 M156 M157 M158 M159 M160
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## M161 M162 M163 M164 M165 M166 M167 M168 M169 M170 M171 M172 M173 M174 M175 M176
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## M177 M178 M179 M180 M181 M182 M183 M184 M185 M186 M187 M188 M189 M190 M191 M192
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## M193 M194 M195 M196 M197 M198 M199 M200 M201 M202 M203 M204 M205 M206 M207 M208
## 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3
## M209 M210 M211 M212 M213 M214 M215 M216 M217 M218 M219 M220 M221 M222 M223 M224
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## M225 M226 M227 M228 M229 M230 M231 M232 M233 M234 M235 M236 M237 M238 M239 M240
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## M241 M242 M243 M244 M245 M246 M247 M248 M249 M250 M251 M252 M253 M254 M255 M256
## 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 3
## M257 M258 M259 M260 M261 M262 M263 M264 M265 M266 M267 M268 M269 M270 M271 M272
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## M273 M274 M275 M276 M277 M278 M279 M280 M281 M282 M283 M284 M285 M286 M287 M288
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## M289 M290 M291 M292 M293 M294 M295 M296 M297 M298 M299 M300 M301 M302 M303 M304
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## M305 M306 M307 M308 M309 M310 M311 M312 M313 M314 M315 M316 M317 M318 M319 M320
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## M321 M322 M323 M324 M325 M326 M327 M328 M329 M330 M331 M332 M333 M334 M335 M336
## 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5
## M337 M338 M339 M340 M341 M342 M343 M344 M345 M346 M347 M348 M349 M350 M351 M352
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## M353 M354 M355 M356 M357 M358 M359 M360 M361 M362 M363 M364 M365 M366 M367 M368
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## M369 M370 M371 M372 M373 M374 M375 M376 M377 M378 M379 M380 M381 M382 M383 M384
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## M385 M386 M387 M388 M389 M390 M391 M392 M393 M394 M395 M396 M397 M398 M399 M400
## 5 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5
## M401 M402 M403 M404 M405 M406 M407 M408 M409 M410 M411 M412 M413 M414 M415 M416
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## M417 M418 M419 M420 M421 M422 M423 M424 M425 M426 M427 M428 M429 M430
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5
#Interpretation of the clusters
mountain.comp <- data.frame(Mountain_data.num,
Clust=factor(mountain.clust)
,Id=row.names(Mountain_data_cleaned))
mountain.df <- melt(mountain.comp, id=c("Id", "Clust"))
head(mountain.df)
## Id Clust variable value
## 1 M1 1 Radiation 0.8088464
## 2 M2 1 Radiation 0.8088464
## 3 M3 1 Radiation 0.8088464
## 4 M4 1 Radiation 0.8088464
## 5 M5 1 Radiation 0.8088464
## 6 M6 2 Radiation 0.7879971
ggplot(mountain.df, aes(y=value, group=Clust, fill=Clust)) +
geom_boxplot()
facet_wrap(~variable, ncol=4, nrow=3)
## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
#number of clusters
fviz_nbclust(Mountain_data_cleaned[,-c(1:9)],
hcut, hc_method="complete",
hc_metric="manhattan",
method = "silhouette",
k.max = 25, verbose = FALSE)
mountain.pam <- pam(Mountain_data_cleaned[,-c(1:9)], k=5)
mountain.pam
## Medoids:
## ID Radiation Phos_P Glu_P SOC_P NT_P PT_P
## M141 141 -0.03267093 0.3045220 0.5926840 -0.03088487 -0.3451440 -0.49409941
## M10 10 0.45536646 0.8451473 1.0224398 0.34835885 0.2399536 4.00384317
## M115 115 0.40849005 1.0207699 0.3845950 0.93075989 0.8733244 -0.02563892
## M231 231 -0.16053254 -0.7836337 -0.4194728 -0.48312210 0.7438410 -0.21702598
## M345 345 0.61475385 -0.9278049 -0.9914553 -0.91414870 -1.2964118 -0.35445886
## K_P pH_P Cond_P Phos_B Glu_B SOC_B
## M141 -0.57305476 -1.0962304 -0.7126877 0.4658142 0.5925751 0.03849061
## M10 -0.22773766 -0.6933468 -0.3154310 0.5962804 0.9344154 0.12707645
## M115 -0.54962348 -1.2733935 -0.4991138 0.9836193 0.6924069 0.72350793
## M231 0.07564058 1.0649627 0.6859244 -0.5120815 -0.2591886 -0.41874832
## M345 0.31951086 0.2374789 -0.8427737 -1.0701228 -1.0747579 -0.97663211
## NT_B PT_B K_B pH_B Cond_B Phos_T
## M141 -0.3137321 -0.48388312 -0.6524294 -1.06489899 -0.64025123 0.2466561
## M10 0.1699264 3.98446315 -0.5489790 -0.42773019 -0.45433535 1.3444510
## M115 0.4105123 0.08044399 -0.6885907 -1.00903348 -0.08134252 1.1212807
## M231 0.4288246 -0.34792356 0.4825208 1.24068573 0.12136046 -0.6388680
## M345 -1.3550670 -0.46576956 -0.5070546 -0.01440091 -0.88139037 -1.0322060
## Glu_T SOC_T NT_T PT_T K_T pH_T
## M141 0.4864576 -0.09301931 -0.2889095 -0.4978368 -0.7313626 -1.03402477
## M10 1.6150132 0.70790908 0.7369885 5.7421401 0.3480502 -0.54822964
## M115 0.8822342 0.92818134 0.6674018 -0.1018243 -0.5895893 -1.20342837
## M231 -0.4456760 -0.50093012 0.3758693 -0.3097234 0.3664316 1.20662937
## M345 -1.0636267 -0.94240733 -1.3234062 -0.4174604 -0.3905646 0.07737753
## Cond_T
## M141 -0.88320505
## M10 -0.03094146
## M115 -0.46360706
## M231 0.35777356
## M345 -1.14094739
## Clustering vector:
## M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16
## 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 1
## M17 M18 M19 M20 M21 M22 M23 M24 M25 M26 M27 M28 M29 M30 M31 M32
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3
## M33 M34 M35 M36 M37 M38 M39 M40 M41 M42 M43 M44 M45 M46 M47 M48
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## M49 M50 M51 M52 M53 M54 M55 M56 M57 M58 M59 M60 M61 M62 M63 M64
## 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1
## M65 M66 M67 M68 M69 M70 M71 M72 M73 M74 M75 M76 M77 M78 M79 M80
## 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## M81 M82 M83 M84 M85 M86 M87 M88 M89 M90 M91 M92 M93 M94 M95 M96
## 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 2
## M97 M98 M99 M100 M101 M102 M103 M104 M105 M106 M107 M108 M109 M110 M111 M112
## 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3
## M113 M114 M115 M116 M117 M118 M119 M120 M121 M122 M123 M124 M125 M126 M127 M128
## 3 3 3 3 3 3 3 3 1 1 1 1 1 3 3 3
## M129 M130 M131 M132 M133 M134 M135 M136 M137 M138 M139 M140 M141 M142 M143 M144
## 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1
## M145 M146 M147 M148 M149 M150 M151 M152 M153 M154 M155 M156 M157 M158 M159 M160
## 1 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1
## M161 M162 M163 M164 M165 M166 M167 M168 M169 M170 M171 M172 M173 M174 M175 M176
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1
## M177 M178 M179 M180 M181 M182 M183 M184 M185 M186 M187 M188 M189 M190 M191 M192
## 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3
## M193 M194 M195 M196 M197 M198 M199 M200 M201 M202 M203 M204 M205 M206 M207 M208
## 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4
## M209 M210 M211 M212 M213 M214 M215 M216 M217 M218 M219 M220 M221 M222 M223 M224
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## M225 M226 M227 M228 M229 M230 M231 M232 M233 M234 M235 M236 M237 M238 M239 M240
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## M241 M242 M243 M244 M245 M246 M247 M248 M249 M250 M251 M252 M253 M254 M255 M256
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## M257 M258 M259 M260 M261 M262 M263 M264 M265 M266 M267 M268 M269 M270 M271 M272
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## M273 M274 M275 M276 M277 M278 M279 M280 M281 M282 M283 M284 M285 M286 M287 M288
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## M289 M290 M291 M292 M293 M294 M295 M296 M297 M298 M299 M300 M301 M302 M303 M304
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## M305 M306 M307 M308 M309 M310 M311 M312 M313 M314 M315 M316 M317 M318 M319 M320
## 4 4 5 5 4 5 4 4 4 4 4 4 4 4 4 4
## M321 M322 M323 M324 M325 M326 M327 M328 M329 M330 M331 M332 M333 M334 M335 M336
## 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5
## M337 M338 M339 M340 M341 M342 M343 M344 M345 M346 M347 M348 M349 M350 M351 M352
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## M353 M354 M355 M356 M357 M358 M359 M360 M361 M362 M363 M364 M365 M366 M367 M368
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## M369 M370 M371 M372 M373 M374 M375 M376 M377 M378 M379 M380 M381 M382 M383 M384
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 1 1
## M385 M386 M387 M388 M389 M390 M391 M392 M393 M394 M395 M396 M397 M398 M399 M400
## 5 2 2 2 2 2 5 5 5 5 5 5 5 1 1 5
## M401 M402 M403 M404 M405 M406 M407 M408 M409 M410 M411 M412 M413 M414 M415 M416
## 5 5 3 5 5 5 5 5 5 5 5 5 5 5 5 5
## M417 M418 M419 M420 M421 M422 M423 M424 M425 M426 M427 M428 M429 M430
## 5 5 4 5 5 5 5 4 5 5 5 5 5 5
## Objective function:
## build swap
## 2.970077 2.956713
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
plot(silhouette(mountain.pam))